In this notebook, the goal of this environment is to implement 4 policy gradient algorithms: the regular policy gradient (REINFORCE), the policy gradient with a baseline subtracted (REINFORCE with baseline), the one-step actor-critic method, and the advantage actor-critic method. Our agent will play a game in continuous state space, for which we will use a basis function representation to limit ourselves to a finite amount of parameters.
Let's start with a few imports:
%load_ext autoreload
%autoreload 2
%matplotlib inline
from environments.environment2 import *
import matplotlib
import matplotlib.cm
import matplotlib.colors
import numpy as np
import random
from tqdm import tqdm
from scipy.special import *
# reasonable plot size
plt.rcParams['figure.dpi'] = 100
The environment is a 2D grid, with a goal in the center, and a "U"-shaped barrier. Our goal is to learn how to reach the objective in the center regardless of where we start. In particular, we will study how the agent learns to go around the barrier when starting at the bottom of the grid.
Here is what the grid looks like: (the objective is in green, the walls are in red).
env = UMaze()
env.render()
We assume that the state space is continuous, i.e. that the agent can be at any allowed position in the grid. At each position, we can move a fixed distance $0.2$ horizontally, vertically, or diagonally. Hence, there are $8$ possible actions:
We grant a reward of $1$ when the agent reaches the objective. Actions that would cause the agent to hit a wall are ignored.
Note: the agent is not penalized when it makes a wrong move, otherwise it might become scared of walls and get stuck below the U.
To model our policy function for the continous state space, we first need to give each state and each action a value. For that, we will use a grid of $N$ gaussian basis functions with a fixed standard deviation $\sigma$. We will study how these two parameters influence the speed and quality of learning.
Let's define the action value function $q(\vec s, a)$ as follows: $$q(\vec s, a) = \sum_{i=1}^N w_{i,a} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right)$$ where $\vec x_i$ denotes the position of the $i$-th basis function. For our policy function $\pi$, we will use a softmax policy: $$\pi(\vec s, a) = \frac{\exp {q(\vec s, a)}}{\sum_{b=0}^7 \exp {q(\vec s, b)}}.$$ Our goal is to learn the $N \times 8$ weights $w_{i, a}$.
For some algorithms, we will also need to learn a state value function $v(\vec s)$. Once again, we will use basis functions: $$v(\vec s) = \sum_{i=1}^N \theta_{i} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right).$$ For the value function, there are $N$ parameters to learn: the weights $\theta_i$.
The standard policy gradient only uses the policy function and the gradient of the log-policy $\nabla_{w} \ln \pi(\vec s, a)$. Here is the pseudocode:

We can subtract a baseline in the policy gradient algorithm. As seen in the lecture, that should improve the speed of learning. We will use the value function as the baseline, and we will also need its gradient to update it.

In the previous two algorithms, we can only perform the update at the end of the episode. The disadvantages are that we need to wait longer before we can learn, and we also use more memory to store all the past states, actions and rewards. With the one-step actor-critic algorithm, the updates are instead performed in real time.

We also look at the eligibility trace version of the actor-critic algorithm, also known as Advantage Actor-Critic.

Your goal is to implement the algorithms explained above, and observe the following:
Here are a few important notes:
Here are a few core functions that we will use in all 3 algorithms. The ModelParams class is used to store the discount factor $\gamma$, the count and shape of the basis functions, and the weights $w_{i,a}$ and $\theta_i$. Please complete the code as necessary.
class ModelParams:
def __init__(self, npoints=13, gamma=0.99, stdev=0.5):
self.action_weights = np.zeros((npoints * npoints, 8))
self.value_weights = np.zeros(npoints * npoints)
# need to be VERY careful here given the large number of iterations
# if gamma is too small we'll have a hard time updating the policy correctly!
self.gamma = gamma
self.stdev = stdev
self.npoints = npoints
positions = np.linspace(-6, 6, npoints)
x_pos, y_pos = np.meshgrid(positions, positions)
self.x_pos = x_pos.flatten()
self.y_pos = y_pos.flatten()
self.w_cache = None
self.w_cache_pos = None
def get_interpolation_weights(self, position: State) -> np.ndarray:
"""
This returns an array of size N, weight for each i: exp(-||s-x_i||^2/(2\sigma^2))
position: a list of two values: first for x pos; second for y pos
"""
# We do a bit of caching to not compute this every time
if self.w_cache_pos == position:
return self.w_cache
w = np.exp(- ((self.x_pos - position[0]) ** 2 + (self.y_pos - position[1]) ** 2) / (2 * self.stdev ** 2))
self.w_cache = w
self.w_cache_pos = position
return w
def compute_value(self, position: State) -> float:
"""
This computes the v(s) function.
"""
w = self.get_interpolation_weights(position)
v = np.sum(w*self.value_weights)
return v
def get_action_values(self, position: State) -> np.ndarray:
"""
This computes [q(s,a) for all a] for a specific s.
"""
w = self.get_interpolation_weights(position)
q = np.sum(np.expand_dims(w,1)*self.action_weights, axis=0)
return q
def get_action_probabilities(self, position: State) -> np.ndarray:
"""
This computes the [\pi(s, a) for all a].
"""
q = self.get_action_values(position)
pi = softmax(q)
return pi
def get_action_probability(self, position: State, action: Action) -> float:
"""
This computes \pi(s, a).
action: an int of [0,7]
"""
return self.get_action_probabilities(position)[action]
def pick_action(self, position: State) -> Action:
"""
This randomly chooses an action following the policy.
"""
probs = self.get_action_probabilities(position)
# np.random.choice is a bit slow, so we code it by hand
r = random.random()
for i, p in enumerate(probs):
r -= p
if r < 0:
return i
def compute_action_log_grad(self, position: State, action: Action) -> np.ndarray:
"""
This computes \nabla_w \ln(\pi(s,a))
"""
pi = self.get_action_probabilities(position)
w = self.get_interpolation_weights(position)
grad = np.zeros_like(self.action_weights)
grad = - np.outer(w, pi)
grad[:,action] += w
return grad
def compute_value_grad(self, position: State) -> np.ndarray:
"""
This computes \nabla_\theta v(s).
"""
w = self.get_interpolation_weights(position)
return w
def find_starting_pos(env: UMaze) -> State:
"""
This picks a random starting position inside the bounds of the environment.
"""
pos_candidate = None
while pos_candidate is None:
pos_candidate = (random.uniform(-5, 5), random.uniform(-5, 5))
# the two below are easier, and can be used for debugging
# pos_candidate = (0, 4)
# pos_candidate = (random.uniform(-1.5, 1.5), random.uniform(-1.5, 1.5))
if env.is_outside_of_area(*pos_candidate) or env.is_in_objective(*pos_candidate):
pos_candidate = None
return pos_candidate
Here are a few functions to provide nice visualization for the learned value and policy functions.
def render_values(params: ModelParams):
"""
Visualises the v(s) function for all possible states.
"""
xs = np.linspace(-6, 6, 101)
ys = np.linspace(-6, 6, 101)
X, Y = np.meshgrid(xs, ys, indexing="ij")
values = np.zeros_like(X)
for i in range(101):
for j in range(101):
values[i, j] = params.compute_value((xs[i], ys[j]))
#plt.imshow(values, extent=[-6, 6, -6, 6])
cs = plt.contourf(X, Y, values, levels=10, cmap="jet")
plt.xlabel("x")
plt.ylabel("y")
cbar = plt.colorbar(cs)
cbar.set_label("Value")
plt.title("Computed values")
def render_actions(params: ModelParams):
"""
This visualises the most probable action in each state using a colored arrow.
Color of the arrow indicates probability.
Only shows a few discrete points.
"""
colormap = matplotlib.cm.get_cmap("jet")
xs = np.linspace(-5, 5, 21)
ys = np.linspace(-5, 5, 21)
for x in xs:
for y in ys:
action_probabilities = params.get_action_probabilities((x, y))
action = np.argmax(action_probabilities)
max_probability = action_probabilities[action]
plt.plot(x, y, "o", color="black", markersize=0.1)
plt.arrow(x, y, 0.3 * MOVES[action][0], 0.3 * MOVES[action][1], head_width=0.1, head_length=0.1, color=colormap(max_probability), linewidth=1)
plt.xlabel("x")
plt.ylabel("y")
cbar = plt.colorbar(matplotlib.cm.ScalarMappable(norm=matplotlib.colors.Normalize(vmin=0, vmax=1), cmap=colormap))
cbar.set_label("Probability")
plt.title("Preferred actions")
Here is a function that you can use to run a learning algorithm, and display a nice visualization of the results.
from IPython.display import Markdown
def run_algorithm(env: UMaze, params: ModelParams, algorithm, plot_values, num_episodes: int, bucket_size: int = 500, **kwargs):
"""
This function runs an algorithm for a specified number of episodes.
Its arguments are:
* env(UMaze): Instance of the environment
* params(ModelParam): Parameters given to the model.
* algorithm(function that accepts env and params as arguments and returns a Tuple[float, int]): Your algorithm.
* plot_values(bool): If True, it will plot the values using one of the helper functions above.
* num_episodes(int): Number of episodes to play.
* bucket_size(int): Specifies how big groups should be when computing averages, i.e. a size of 500 averages a reward for diplay over 500 episodes.
* Extra arguments specified like "action_rate=0.1" will be passed to the algorithm
"""
assert num_episodes % bucket_size == 0, "num_episodes must be a multiple of bucket_size"
num_buckets = num_episodes // bucket_size
rewards = np.zeros(num_buckets)
lengths = np.zeros(num_buckets)
for i in tqdm(range(num_episodes)):
random.seed(i) # set the seed to the episode id to ensure reproducibility
episode_reward, episode_length = algorithm(env, params, **kwargs)
rewards[i // bucket_size] += episode_reward / bucket_size
lengths[i // bucket_size] += episode_length / bucket_size
plot_points = (np.arange(num_buckets) + 1) * bucket_size
display(Markdown("#### %s (sigma=%.2f, npoints=%d^2)" % (algorithm.__name__, params.stdev, params.npoints)))
plt.subplot(221)
plt.plot(plot_points, rewards)
plt.title("Fraction of episodes where the goal is reached before %d iterations" % 300) # Max iterations is hardcoded to 300 by default
plt.xlabel("Episode")
plt.ylabel("Fraction")
plt.subplot(222)
plt.plot(plot_points, lengths)
plt.title("Episode length over time")
plt.xlabel("Episode")
plt.ylabel("Episode length")
plt.subplot(223)
render_actions(params)
if plot_values:
plt.subplot(224)
render_values(params)
plt.subplots_adjust(wspace=0.2, hspace=0.3)
#plt.tight_layout()
plt.gcf().set_size_inches(12, 8)
plt.show()
First, implement the reinforce algorithm:
def reinforce(env: UMaze, params: ModelParams, action_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
states = [env._state]
actions = []
rewards = []
iteration = 0
while iteration < max_iter and not env.end:
# Compute action
action = params.pick_action(states[-1])
actions.append(action)
# Do action
state, reward = env.do_action(action)
states.append(state)
rewards.append(reward)
iteration += 1
rewards = np.array(rewards)
for t in range(len(actions)):
G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
params.action_weights += action_rate * params.gamma**t * G * params.compute_action_log_grad(states[t], actions[t])
return sum(rewards), iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:24<00:00, 26.03it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:31<00:00, 36.87it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:10<00:00, 76.75it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:34<00:00, 106.38it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:03<00:00, 81.14it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:26<00:00, 25.90it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:56<00:00, 56.74it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:37<00:00, 102.77it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.86it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:50<00:00, 43.40it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:25<00:00, 22.46it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:17<00:00, 26.51it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:07<00:00, 78.54it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:39<00:00, 100.46it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:44<00:00, 95.34it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:04<00:00, 32.80it/s]
Now, copy/paste and modify your reinforce function above to subtract a basline.
def reinforce_with_baseline(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
states = [env._state]
actions = []
rewards = []
iteration = 0
while iteration < max_iter and not env.end:
# Compute action
action = params.pick_action(states[-1])
actions.append(action)
# Do action
state, reward = env.do_action(action)
states.append(state)
rewards.append(reward)
iteration += 1
rewards = np.array(rewards)
for t in range(len(actions)):
G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
delta = G - params.compute_value(states[t])
params.value_weights += value_rate * delta * params.compute_value_grad(states[t])
params.action_weights += action_rate * params.gamma**t * delta * params.compute_action_log_grad(states[t], actions[t])
return sum(rewards), iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:44<00:00, 28.99it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:20<00:00, 38.32it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:52<00:00, 58.12it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:23<00:00, 69.76it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:39<00:00, 62.72it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:55<00:00, 28.12it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:29<00:00, 47.78it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:30<00:00, 66.42it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.56it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 51.98it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:39<00:00, 25.04it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:52<00:00, 28.36it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 52.03it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:35<00:00, 64.22it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:13<00:00, 75.14it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:10<00:00, 32.23it/s]